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[57] ABSTRACT 

Data processors, each comprising a distributed process- 
ing system, include means for executing checkpoint 
restart processing synchronously with each other. Each 
of the data processors erases old checkpoint data when 
it can execute restart processing based on new check- 
point data. When one of the data processors cannot 
execute the restart processing based on the new check- 
point data, the other of the data processors executes the 
restart processing based on the old checkpoint data 
synchronously with the one of the data processors. 
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which are installed in each of a plurality of data proces- 

DISTRIBUTED PROCESSING SYST EM WITH sors constituting said system and which execute a pre- 
CHECKPOINT RESTART FACILITIES WHEREIN determined program, for allowing ttansmission and 
CHECKPOINT DATA IS UPDATED ONLY IF ALL reception of data between said data processors; check- 
PROCESSORS WERE ABLE TO COLLECT NEW 5 point processing means for causing one of said data 
CHECKPOINT DATA processors to request another data processor to collect 

checkpoint data via said communication control means 
BACKGROUND OF THE INVENTION and for detennining, based on the response from the 

1. Field of the Invention other data processor, the values of checkpoint data 
The present invention relates to checkpoint restart 10 necessary for restart processing data storage means 

facilities installed in each of a number of data processors installed in each of said data processors for storing the 
which are interconnected to form a distributed process- checkpoint data determined by said checkpoint process- 
ing system. ing means; and restart processing means installed in 

2. Description of the Related Art each of said data processors for restarting said program 
A distributed processing system is a system in which 15 on the basis of the checkpoint data stored in said data 

a plurality of data processors are interconnected by storage means. 

communication lines so that data transmission and re- Additional objects and advantages of the invention 
ception can occur among the data processors. With will be set forth in the following description, and in part 
such a system, the data processors can share data and will be obvious from the description, or may be learned 
execute distributed processing of an application pro- 20 by practice of the invention. The objects and advan- 
gram. tages of the invention may be realized and obtained by 

In a data processing system, checkpoint restart facili- means of the instrumentalities and combinations partic- 
ties are well known in which, when a system failure ^axly pointed out in the appended claims, 
occurs in the system, processing is allowed to continue 

from the last checkpoint of a program which has nor- 25 BRIEF DESCRIPTION OF THE DRAWINGS 
mally been executed before the system failure occurs. accompanying drawings, which are incorpo- 

The faculties save information (i.e., checkpoint data) wted m md ooiadtatt a part of the specification, illus- 
necessary to restart the execution of a program from ^ a presently pre f e rred embodiment of the invention 
that point in the program execution at which the infor- and tQgether with the general description gi ven above 

mation is saved. u . . , ( JU andthe detailed description of the preferred embodi- 

In the distributed processing system, each of the data mem ^ bd ^ ^ ^ ^ of ^ 

processors has the checkpoint restart facilities and inde- . ? r r r 

pendently executes its checkpoint restart facilities to • ^ ^ distributed processing 

recover from a Failure. A . . . Al _ 4 * 4 - 

In the distributed processing system, however, with 35 ^1™ embodying the present invention; 

data transmission and reception performed between FIG - 2 » a conceptual block diagram lUustratmg the 

data processors, even if one of the data processors oper- Pressing means m the embodiment of FIG. 1; 

ates properly, a system failure may occur in another ™. 3 is a flowchart for explaining the contents of 

data processor. In such a case, the following problem the checkpomt processmg m the embodiment of FIG. 1; 

will arise when each of the data processors indepen- 40 ^IG. 4 is a flowchart for explaining the contents of 

dently performs the checkpoint restart facilities as de- thefa^re processing in the embodiment of FIG. 1; and 

scribed above. That is, with data transmitted from one J lG 5 15 a flowchart for explaining the contents of 

data processor to another data processor, when a failure ™ restart processing in the embodiment of FIG. 1. 

occurs in the former, it will execute the checkpoint DETAILED DESCRIPTION OF THE 

restart facilities to continue processing from a point 45 . PREFERRED EMBODIMENT 

prior to occurrence of the failure. In this case, the latter . 

which functions properly does not recognize that the Referring now to FIG. 1, a distributed processing 

restart facilities has been executed in the former and system according to an embodiment of the present in- 

thus will not identify whether data transmitted from the vention is comprised of a plurality of data processors 

former is data transmitted prior to occurrence of the 50 1<)A and 10B which are interconnected by a communi- 

failure or is fresh data. Therefore, a situation in which cation **ne 11 so that data is transmitted and received 

no data necessary for the current processing is transmit- between the data processors. Each of data processors 

ted to the latter may take place, and a malfunction may *0A and 10B serve as a node of a computer network, 

occur in data communication. This will lower the reii- Data processor 10A comprises: a communication 

ability of the distributed processing system. 55 control unit (CCU) 12A, a central processing unit 

(CPU) 13A and a memory unit 14A which stores vari- 
SUMMARY OF THE INVENTION ous application programs. CCU UA transmits and re- 
It is therefore an object of the present invention to ceives data over communication line 11 under the con- 
provide a distributed processing system with check- troi of CPU 13A. CPU 13A executes various types of 
point restart facilities which allows each of a number of 60 data processing, such as checkpoint processing, failure 
data processors comprising the system to execute the processing and restart processing, associated with the 
checkpoint restart facilities synchronously with another present embodiment on the basis of the programs stored 
data processor in the system to prevent difficulties from in memory unit 14A. Data processor 10A is further 
arising in data transmission and reception among the provided with a bard -disk drive (HDD) ISA serving as 
data processors. 65 a filing system. CPU 13A writes into or reads from 
According to the present invention there is provided HDD ISA via an I/O interface 16A. HDD ISA has an 
a distributed processing system with restart processing area for storing checkpoint data associated with the 
faculties comprising: communication control means, present embodiment. 
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Data processor 10B likewise comprises a CCU 12B. a 23A (step S2). The contents of the checkpoint data 
CPU 13B, a memory unit 14B, an HDD 15B, and an comprise a program status word (PSW) representing 
I/O interface 16B. CCU 12B is connected to CCU 12A variable information (e.g., parameters) necessary to 
through communication line 11 to receive data from make a restart of the application program which has 
and send data to data processor 10A. CPU 13B, mem- 5 been stopped by the system failure. At this point file 
ory unit 14B, HDD 15B and I/O interface unit 16B area 23B stores old checkpoint data which was col- 
have the same functions as the corresponding compo- lected by checkpoint processing task 20A prior to the 
nents in data processor 10A. present time. 

In such a system, CPU 13A of data processor 10A Next, checkpoint processing task 20A requests 
executes given data processing corresponding to an 10 checkpoint processing task 20B of data processor 10B 
application program stored in memory unit 14A, while to collect new checkpoint data for the program exe- 
CPU 13B of data processor 10B executes given data cuted by CPU 13B in accordance with the request to 
processing corresponding to an application program send data (step S3). That is, task 20A transmits to task 
stored in memory unit 14B. At this point CPU 13B is 20B ID information the state of the variables in the 
executing distributed processing of a given job together 15 program that CPU 13A is executing and requests task 
with CPU 13A. CPU 13A and CPU 13B transmit and 20B to collect new checkpoint data in a program to be 
receive data between themselves via communication executed by CPU 13B. Task MB in data processor 10B 
line 11. collects the new checkpoint data as requested by task 

With such distributed processing executed by data 20A and stores it in file area 24A of HDD 15B (step S4). 
processors 10A and 10B, if a system failure occurs, the 20 At this time task 20B identifies, using the ID informs- 
processing by CPU 13A and CPU 13B is temporarily tion from task 20A, the program that CPU 13B executes 
stopped. To recover from the system failure, check- in response to the request to send from CPU 13A and 
point restart processing is executed. That is, as shown in collects the new checkpoint data of the program. The 
FIG. 4, CPU 13A and CPU 13B execute a failure de- old checkpoint data collected previously by checkpoint 
tecting process based on the program stored in memory 25 processing task 20B in file area 24B of HDD 13B is 
units 14A and 14B (step SU) while executing an appli- stored. 

cation program (step S10). When the system failure is In response to reception of an ACK (acknowledg- 
detected by the failure detecting process, the check ment) from task 20B of data processor 10B, task 20A of 
point restart process is executed (steps S12, S13). data processor 10A decides whether or not data proces- 

Next, the checkpoint restart process will be described 30 sor 10B has normally received data from data processor 
with reference to FIGS. 2, 3 and 5. 10A (step S5). In other words, task 20B transmits an 

FIG. 2 is a conceptual diagram of means for execut- ACK to task 20A when the collection of the new check- 
ing the checkpoint restart processing. That is, in each of point data as requested by task 20A was made success- 
data processors 10A and 10B, failure processing tasks fully, but transmits no ACK when it was made unsuc- 
22A and 22B execute the system failure detecting pro- 35 cessfully. 

cess. Restart processing tasks 21A and 21B executes the When the ACK is transmitted from task 20B, task 
checkpoint restart processing in response to detection 20A erases the old checkpoint data stored in file area 
of the system failure by failure processing tasks 22A and 23B and designates the new checkpoint data stored in 
22B. Checkpoint processing tasks 20A and 20B executes file area 23A as the current checkpoint date (step S6). 
a process of determining data necessary for restart pro- 40 Likewise task 20B erases the old checkpoint data stored 
cessing. Tasks 20A, 21A and 22A described herein re- in file area 24B and designates the new checkpoint data 
spectively correspond to execution by CPU 13A of the stored in file area 24A as the current checkpoint data, 
restart processing program, the checkpoint processing When no ACK is transmitted from task 20B, on the 
program and the failure processing program, which are other hand, restart processing task 21A is prompted by 
all stored in memory unit 14A. Likewise tasks 20B, 21B 45 task 20 to execute the checkpoint restart process Ulus- 
and 22B respectively correspond to execution by CPU trated in FIG. 5 (step S7). In this case, in data processor 
13B of the restart processing program, the checkpoint 10A, a failure occurs in data communication with data 
processing program and the failure processing program, processor 10B and the execution of the application pro- 
which are all stored in memory unit 14A. gram by CPU 13A is stopped to execute the restart 

HDD 15A in data processor 10A has a file area 23A 50 process, 
for storing the latest checkpoint data at the time of the Next, the checkpoint restart process will be described 
occurrence of the system failure and a file area 23B for with reference to FIG, 5. 

storing the old data prior to the occurrence of the sys- First, restart processing task 21A of data processor 
tern failure. Likewise HDD 15B in data processor 10B 10A reads the new checkpoint data from file area 23A 
has a file area 24A for storing the latest checkpoint data 55 (step S20). Further, task 21A makes a restart request to 
at the time of the occurrence of the system failure and a restart processing task 21B of date processor 10B to 
file area 24B for storing the old data prior to the occur- execute a restart process based on the corresponding 
rence of the system failure. new checkpoint data (step S21). Responsive to the re- 

Checkpoint processing tasks 20A and 20B execute a start request from task 21A, task 21B reads the new 
process as indicated in FIG. 3. For example, when the 60 checkpoint data from file area 24A (step S22). On suc- 
dislributed processing is being executed by data proces- cess in reading the new checkpoint data from file area 
sors 10A and 10B, data processor 10A executes a step 24A, task 21B transmits an ACK to task 21A. No ACK 
SI of requesting data processor 10B to send data associ- will be transmitted when the reading of the new check- 
ated with the distributed processing. That is, CPU 13 A point date from file area 24A ends in failure. On receiv- 
executes an application program and prompts CPU 65 ing the ACK from task 21B of data processor 10B, task 
13B, via communication line 11, to send data associated 21A of data processor 10A decides whether or not data 
with the data processing. At this point checkpoint pro- processor 10B has normally received data (restart re- 
cessing task 20A stores new checkpoint data in file area quest) from data processor 10A (step S23). 
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On receiving the ACK from task 21B, task 21A exe- 
cutes the restart process on the basis of the new check- 
point data read from file area 23 A (step S24). That is, 
CPU 13A restarts the application program which has 
been stopped from a restart point specified by the new 5 
checkpoint data. Likewise task 218 executes the restart 
process on the basis of the new checkpoint data read 
from file area 23A. That is, CPU 13B restarts the appli- 
cation program which has been stopped from a restart 
point specified by the new checkpoint data, 10 

When no ACK is transmitted from task 21B, on the 
other hand, task 21 A reads the old checkpoint data from 
file area 23B to execute the restart process on the basis 
of the old checkpoint data (steps S25 and S26). That is, 
CPU 13A restarts the program from a restart point 15 
specified by checkpoint data which has been collected 
prior to the new checkpoint data. Likewise task 21B 
reads the old checkpoint data from file area 24B to 
execute the restart process on the basis of the old check- 
point data. That is, CPU 13B restarts the program from 20 
a restart point specified by checkpoint data which has 
been collected prior to the new checkpoint data. 

In this way, when a system failure occurs in the dis- 
tributed processing system having data processors 10A ^ 
and 10B, data processors 10A and 10B do not execute 
the checkpoint restart facilities independently of one 
another but instead execute the checkpoint restart facili- 
ties synchronously with one another. That is, in case 
where data processor 10A functions properly, but a 3Q 
system failure occurs in data processor 10B, data pro- 
cessor 10A executes the restart process based on the old 
checkpoint data together with data processor 10B. Ac- 
cordingly, when data processor 10B cannot execute the 
restart process based on the new checkpoint data, data 35 
processor 10A executes the restart process based on the 
old checkpoint data, thereby executing the data pro- 
cessing in synchronization with data processor 10B. 

As can been seen, where a system failure occurs in 
one of the data processors, in a distributed processing 40 
system which needs transmission and reception of data 
between data processors, transmission of currently un- 
necessary data from the properly functioning data pro- 
cessor to the malfunctioning data processor can surely 
be avoided. In other words, it is possible to implement 45 
transmission of currently necessary data between the 
data processors. As a result, the occurrence of malfunc- 
tions in data communications between the data proces- 
sors can be avoided, thus improving the reliability of the 
distributed processing system. 50 

Additional advantages and modifications will readily 
occur to those skilled in the art. Therefore, the inven- 
tion in its broader aspects is not limited to the specific 
details, representative devices, and illustrated examples 
shown and described herein. Accordingly, various 35 
modifications may be made without departing from the 
spirit or scope of the general inventive concept as de- 
fined by the appended claims and their equivalents. 
What is claimed is: 

1. A distributed processing system with a restart pro- 60 
cessing function comprising: 

a first data processor and a second data processor, 
each executing a distributed processing; 

communication control means, installed in each said 
first processor and said second data processor, for 65 
allowing transmission and reception of data be- 
tween said first data processor and said second data 
processor; 



6 

checkpoint processing means, installed in each said 
first data processor and said second data processor, 
for determining new checkpoint data as current 
checkpoint data for executing restart processing 
when an ACK signal is transmitted from said sec- 
ond data processor to said first data processor in 
response to a request to collect said new check- 
point data made by said first data processor, and 
determining old checkpoint data as said current 
checkpoint data when no ACK signal is transmit- 
ted from said second data processor in response to 
said request made by said first data processor; 

data storage means, installed in each said first data 
processor and said second data processor, for stor- 
ing the current checkpoint data determined by said 
checkpoint processing means; and 

restart processing means, installed in each said first 
data processor and said second data processor, for 
restarting execution of said distributed processing 
in accordance with said current checkpoint data 
stored in said data storage means. 

2. A system according to claim 1, further comprising 
failure detecting means, installed in each said first data 
processor and said second data processor, for detecting 
a failure in at least one of said first data processor and 
said second data processor, said restart processing 
means responding to detection of a failure by said fail- 
ure detecting means by executing said restart processing 
based on said current checkpoint data stored in said data 
storage means. 

3. A distributed processing system with a restart pro- 
cessing function comprising: 

a first data processor and a second data processor, 
each executing a distributed processing; 

communication control means, installed in each said 
first data processor and said second data processor, 
for allowing transmission and reception of data 
between said first data processor and said second 
data processor; 

first checkpoint processing means, installed in each 
said first data processor and said second data pro- 
cessor, for generating old checkpoint data and new 
checkpoint data necessary for restart processing in 
accordance with a point at which said distributed 
processing is being executed; 

data storage means, installed in each said first data 
processor and said second data processor and hav- 
ing storage areas, for storing said old checkpoint 
data and said new checkpoint data generated by 
said first checkpoint processing mans; 

second checkpoint processing means for determining 
said new checkpoint data as current checkpoint 
data when an ACK signal is transmitted from said 
second data processor to said first data processor in 
response to a request to collect said new check- 
point data made by said first data processor, and 
determining said old checkpoint data as said cur- 
rent checkpoint data when no ACK signal is trans- 
mitted from said second data processor in response 
to said request made by said first data processor; 
and 

restart processing means, installed in each said first 
data processor and said second data processor, for 
restarting execution of said distributed processing 
in accordance with said current checkpoint data 
stored in said data storage means. 

4. A system according to claim 3, further comprising 
failure detecting means installed in each said first data 



12/14/2003, EAST Version: 1.4.1 



processor and said second data processor for detecting 
a failure in at least one of said first data processor and 
said second data processor, said restart processing 
. means responding to detection of a failure by said fail- 
ure detecting means to execute said restart processing 
based on said current checkpoint data stored in said data 
storage means. 
5. A distributed processing system comprising: 
a plurality of data processing apparatuses; 
communication control means, installed in each of 
said data processing apparatuses, for controlling 
transmission and reception of data between said 
data processing apparatuses; 
data storage means, installed in each of said data 
processing apparatuses, for storing new and old 
checkpoint data; . . 
checkpoint processing means, installed in each of said 
data processing apparatuses, for storing first new 
checkpoint data in said data storage means for one 
of said data processing apparatuses and requesting 
that another of said data processing apparatuses 
collect second new checkpoint data in response to 
transferred data when said communication control 
means transfers data to said another of said data 
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processing apparatuses, and for clearing old check- 
point data in response to an acknowledge, signal 
representing that said another of said data process- 
ing apparatuses collected said second new check- 
5 point data accurately; and 

restart processing means, installed in each of said data 
processing apparatuses, for reading one of said first 
new checkpoint data and said old checkpoint data 
from said data storage means' and restarting pro- 
10 cessing based on read checkpoint data when said 
checkpoint processing means does not receive said 
acknowledge signal. 
6. A system according to claim 5, in which said re- 
start processing means reads said first new checkpoint 
15 data and requests said another of said data processing 
apparatuses to read said second new checkpoint data 
when said checkpoint processing means does not re- 
ceive said acknowledge signal, and restarts processing 
based on said first new checkpoint data stored in said 
20 data storage means in response to another acknowledge 
signal representing that said another data processing 
apparatus collected said second new checkpoint data 
accurately. 
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